-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
adding relations #158
base: main
Are you sure you want to change the base?
adding relations #158
Conversation
… the same object, just the values need to be the same after to/from json
…t instantiate spangroups
…; base Relation class on storage of these names; define to and from_json
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The overall design seems good to me! I don't quite understand why we need AnnotationName
classes though. What does the extra overhead of this class get us?
mmda/types/annotation.py
Outdated
# TODO[kylel] - when does this ever get called? infinite loop? | ||
return self.__getattribute__(field) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This SO answer has a good explainer. I don't think you want to keep this here--- __getattribute__
is called before __getattr__
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
@@ -142,47 +178,38 @@ def __deepcopy__(self, memo): | |||
|
|||
@property | |||
def type(self) -> str: | |||
logging.warning(msg='`.type` to be deprecated in future versions. Use `.metadata.type`') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: would not use the root logger. I would add
Logger = logger.getLogger(__file__)
after imports, and then call Logger.warning
instead.
mmda/types/annotation.py
Outdated
# TODO[kylel] - when does this ever get called? infinite loop? | ||
return self.__getattribute__(field) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You probably can remove it; this stack overflow post has a good explanation, but tl;dr is that __getattribute__
supersedes __getattr__
if defined, so you will never be in a situation where __getattribute__
should be called after __getattr__
.
I suspect that the function of this call here is to raise an error if field
is not defined; if that is the case, I would explicitly raise an AttributeError
instead with a more descriptive error message.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removed
boxes: List[Box], | ||
id: Optional[int] = None, | ||
doc: Optional['Document'] = None, | ||
field: Optional[str] = None, | ||
metadata: Optional[Metadata] = None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not related to this PR specifically, but I think we should document these arguments in a docstring.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yea we'll need that
|
||
return doc | ||
|
||
def locate_annotation(self, name: AnnotationName) -> Annotation: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not quite sure on why we need AnnotationName
. If a name is just a combo of relation name + id, can we just expect a tuple of the two here? or require two args? Overall, having a class to just hold annotation names would just result in computational overhead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see comment below
Without the class, we would need to code somewhere how IDs are constructed in the library. For now, it's As well, we need some way to parse this ID for use in lookup of that specific element within a Document. I don't want |
def from_str(cls, s: str) -> 'AnnotationName': | ||
field, id = s.split('-') | ||
id = int(id) | ||
return AnnotationName(field=field, id=id) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
return AnnotationName(field=field, id=id) | |
return cls(field=field, id=id) |
Prevents issues when inheriting (if we ever decide to).
@@ -34,6 +35,22 @@ def warn_deepcopy_of_annotation(obj: "Annotation") -> None: | |||
warnings.warn(msg, UserWarning, stacklevel=2) | |||
|
|||
|
|||
class AnnotationName: | |||
"""Stores a name that uniquely identifies this Annotation within a Document""" | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
__slots__ = ("field", "id") | |
Speeds up class creation by roughly 20%:
>>> %timeit AnnotationName('relation', 1)
140 ns ± 0.391 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)
# after adding slots
>>> %timeit AnnotationName('relation', 1)
115 ns ± 0.163 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)
@kyleclo Sounds good! added two small suggestions to improve it, but otherwise ok to merge! |
This PR extends this library functionality substantially -- Adding a new Annotation type called Relation. A Relation is a link between 2 annotations (e.g. a Citation linked to its Bib Entry). The input Annotations are called
key
andvalue
.A few things needed to change to support Relations:
Annotation Names
Relations store references to Annotation objects. But we didn't want
Relation.to_json()
to also.to_json()
those objects. We only want to store minimal identifiers of thekey
andvalue
. Something short likebib_entry-5
orsentence-13
. We call these short stringsnames
.To do this, we added to Annotation class, an optional attribute called
field: str
which stores this name. It's automatically populated when you runDocument.annotate(new_field = list_of_annotations)
; each of those input annotations will have the new field name stored under.field
.We also added a method
name()
that returns the name of a particular Annotation object that is unique at the document-level. Names are a minimal class that basically stores.field
and.id
.In short, now after you annotate a Document with annotations, you can do stuff like:
Lookups based on names
To support reconstructing a Relation object given the names of
key
andvalue
, we need the ability to lookup those involved Annotations. We introduce a new method to enable this:to and from JSON
Finally, we need some way to serializing to JSON and reconstructing from JSON. For serialization, now that we have Names, this makes the JSON quite minimal:
Reconstructing a Relation from JSON is more tricky because it's meaningless without a Document object. The Document object must also store the specific Annotations correctly so we can correctly perform the lookup based on these Names.
The API for this is similar, but you must also pass in the Document object: